New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[DO NOT MERGE] Upstream codebase diff #470

Draft

kzawora-intel wants to merge 1,008 commits into main from habana_main

kzawora-intel commented Nov 6, 2024 •

edited

Loading

Scope of changes:

Contiguous PA
Multi-step scheduling
Automatic prefix caching
Padding-aware scheduling/max_num_prefill_seqs
Guided decoding fixes
FP8 support (INC/w8a8/weights_load_device)
ApplyToppTopkScalar sampler optimization
LoRA/MultiLoRA support
FusedMoE support
Model changes (adding mark_steps)
Tests
FakeHPU mode
CI stuff (.jenkins, .github)
Lots of minor stuff (RNG, FSDPA flag, reduced block fragmentation)

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/cpu-test.yml

		@@ -0,0 +1,35 @@
		name: cpu-test

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help

kzawora-intel marked this pull request as draft

November 6, 2024 13:49

kzawora-intel added the habana label

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/codespell.yml

		@@ -0,0 +1,45 @@
		name: codespell

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help

github-advanced-security bot found potential problems

View reviewed changes

tests/distributed/test_utils.py

+              def test_stateless_process_group(worker):
+                  port1 = get_open_port()
+                  with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                      s.bind(("", port1))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium test

'' binds a socket to all interfaces.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. In this case, we can bind it to the loopback interface 127.0.0.1, which is commonly used for local testing and development. This change will limit the socket to accept connections only from the local machine, reducing the security risks.

Suggested changeset 1

tests/distributed/test_utils.py

@@ -124,3 +124,3 @@
                 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-                    s.bind(("", port1))
+                    s.bind(("127.0.0.1", port1))
                     port2 = get_open_port()

Copilot is powered by AI and may make mistakes. Always verify output.

github-advanced-security bot found potential problems

View reviewed changes

vllm/entrypoints/openai/api_server.py

    
                  sock = socket.socket(family=family, type=socket.SOCK_STREAM)

                  sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

                  sock.bind(addr)

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to ensure that the socket binds to a specific interface rather than all interfaces. This can be achieved by modifying the create_server_socket function to check if the provided address is empty or 0.0.0.0 and raise an error or use a default specific interface instead.

Modify the create_server_socket function to validate the address.
If the address is empty or 0.0.0.0, raise an error or use a default specific interface.
Update the run_server function to handle the potential error raised by create_server_socket.

Suggested changeset 1

vllm/entrypoints/openai/api_server.py

@@ -612,2 +612,5 @@
             def create_server_socket(addr: Tuple[str, int]) -> socket.socket:
+                if addr[0] in ("", "0.0.0.0"):
+                    raise ValueError("Binding to all interfaces is not allowed. Please specify a valid IP address.")
                 family = socket.AF_INET
@@ -640,3 +643,7 @@
                 sock_addr = (args.host or "", args.port)
-                sock = create_server_socket(sock_addr)
+                try:
+                    sock = create_server_socket(sock_addr)
+                except ValueError as e:
+                    logger.error(e)
+                    return

Copilot is powered by AI and may make mistakes. Always verify output.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA(),'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA= ),A('.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA()'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=)A('.

vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py

    
                  # Llama3.2 models more reliable.

                  TOOL_CALL_REGEX = re.compile(

                      r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ')A(A='.

mfylcek and others added 24 commits

November 25, 2024 09:36


          Limit decode block size (#532)

39c6b6c

Limit decode bucket size to num_hpu_blocks


          fix marlin flag set on hpu (#540)

5eb8b1f


          [misc] move functions to config.py (vllm-project#10624)

05d1f8c

Signed-off-by: youkaichao <[email protected]>


          [Model] Support is_causal HF config field for Qwen2 model (vllm-pro…

ed46f14

…ject#10621)

Signed-off-by: DarkLight1337 <[email protected]>


          Super tiny little typo fix (vllm-project#10633)

2b0879b


          [Bug]: Authorization ignored when root_path is set (vllm-project#10606)

d04b13a

Signed-off-by: chaunceyjiang <[email protected]>


          [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…

c27df94

…es (vllm-project#9850)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>


          [Docs] Add Snowflake Slides (vllm-project#10641)

452a4e8

Signed-off-by: simon-mo <[email protected]>


          [Model]: Add support for Aria model (vllm-project#10514)

b1d9205

Signed-off-by: xffxff <[email protected]>
Co-authored-by: Isotr0py <[email protected]>


          [Model] Enable optional prefix when loading embedding models (vllm-pr…

cf73f0c

…oject#10639)

Signed-off-by: DarkLight1337 <[email protected]>


          [Doc] Fix typos in docs (vllm-project#10636)

1b583cf

Signed-off-by: DarkLight1337 <[email protected]>


          [Model] Add OLMo November 2024 model (vllm-project#10503)

9db713a


          [misc] do not read HOST_IP (vllm-project#10644)

6e9ff05

Signed-off-by: youkaichao <[email protected]>


          [bugfix] fix aria model and add torch.compile (vllm-project#10645)

45ac4ff

Signed-off-by: youkaichao <[email protected]>


          [Feature] vLLM ARM Enablement for AARCH64 CPUs (vllm-project#9228)

a6760f6

Signed-off-by: Sanket Kale <[email protected]>
Co-authored-by: Sanket Kale <[email protected]>
Co-authored-by: mgoin <[email protected]>


          [v1] EngineArgs for better config handling for v1 (vllm-project#10382)

519e8e4

Signed-off-by: rickyx <[email protected]>


          custom allreduce + torch.compile (vllm-project#10121)

9a88f89

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>


          [Misc] Remove outdated init protocols (vllm-project#10655)

Signed-off-by: DarkLight1337 <[email protected]>


          [ci] add vllm_test_utils (vllm-project#10659)

334d64d

Signed-off-by: youkaichao <[email protected]>


          Fix profile run for multi LoRA (#549)

0f513bd

Fixes issue with multi LoRA during `profile_run`.


          fix cutlass_fp8_supported flag set on hpu


          Fix cutlass_fp8_supported flag set on HPU (#550)

38c2d10


          [HPU] Add mark_step configurable for the decoder layer. (#525)

b62f1b2

We are seeing 10% performance regression in the llama-based model due to
vllm-project#10239. The mark_step()
function needs to be configured differently for each model to achieve
the best performance. For some models, mark_step() for every decoder
step would be optimal, but for other models, it's better to run it every
n-th step. We are adding a counter to only register the hook for every
n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS


          Update cpu-test.yml (#544)

633df59

kzawora-intel and others added 5 commits

December 11, 2024 13:11


          Revert "Revert "Dec 10 rebase""

ad10b73


          Revert "Revert "Dec 10 rebase"" (#619)

df7dd05

i think inception was a decent movie overall


          fix graceful shutdown

07dbd34

Signed-off-by: Konrad Zawora <[email protected]>


          Fix multiprocessing executor shutdown (#621)

d312c92

With this patch, mp executor does not hang at the end of application out
of the box, and exits gracefully.


          Update GitHub Actions targets (#622)

7ef6b2c

New useful checks were added, and we're not running them on habana_main
per PR. This PR fixes that.

github-advanced-security bot found potential problems

View reviewed changes

.github/workflows/sphinx-lint.yml

		@@ -0,0 +1,32 @@
		name: Lint documentation

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help

kdamaszk and others added 24 commits

December 12, 2024 09:41


          Add padding to encoder_seq_lens (#610)

449a89d

Without this change we can observe below error:
```
[rank0]:   File "/software/users/kdamaszke/repos/vllm-fork/vllm/model_executor/models/mllama.py", line 959, in forward
[rank0]:     full_text_row_masked_out_mask = full_text_row_masked_out_mask.view(
[rank0]: RuntimeError: shape '[4, -1, 1]' is invalid for input of size 3
```
It occurs when one of the requests is removed from the batch earlier. In
that case, language model is still working on the shapes padded to the
bucketed batch size, while encoder input doesn't. This change is
aligning the batch size on `encoder_seq_lens` to the expected one.


          Remove workaround for one_hot in eager/compile (#632)

d2128b4

As now one_hot operator has implementation for eager and compile mode
the workaround is not needed any longer.


          Add shutdown_inc method to MultiprocessingHPUExecutor (#634)

11c07e3


          Fix recompilations due to different batch_sizes in MSS (#637)

ba1d24b

Fix for batch size padding in multi-step scheduling by
@SanjuCSudhakaran.

Co-authored-by: Sanju C Sudhakaran <[email protected]>


          Fix CI reports (#636)

c9a740f


          Unit scales in FP8 CI scenarios (#633)

da61ecf


          TC llama recompile fix - no_grad to inference_mode (#640)

d81f829

during warmup the inference mode is used, but at runtime it's
overwritten by inference mode - this causes recompilations due to
dispatch key mismatch in torch.compile.
This switches the no_grad mode to inference_mode from base class.

---------

Co-authored-by: Rafal Litka <[email protected]>


          Generic call for prepare_cos_sin in rotary embedding (#638)

88ef381

Generic name discovery for rope.prepare_cos_sin. It fixes errors in
models that don't follow a specific naming hierarchy


          Update CODEOWNERS (#649)

9555fef

Add new member to list of codeowners.


          Fix long contexts in LoRA (#624)

2443ba9

#566 breaks long-contexts +
LoRA flow.

This assumes caching sin-cos buffer for first decoder layer is
sufficient to handle all cases, which is not the applicable for
long-context + LoRA.

This PR ignores `_prepare_cos_sin` call prior to HpuModelAdapter forward
in long-context + LoRA flow.


          Lora manager tests fix (#652)

This PR solves the "ModuleNotFoundError: No Module named torch.hpu" in
test_lora_manager_hpu.py::test_from_lora_tensors by importing
"habana_frameworks.hpu" to lora model.

Co-authored-by: Vivek Goel <[email protected]>


          Fix LoRA tests (#664)

5b5bf26

This PR updates `test_layers_hpu.py` and `test_lora_hpu.py` to align
with `PunicaWrapper` refactor.

Related PR: #614


          [BUG fix] Rebase caused spec decode fix (#613)

2d24be7

Error reported in https://jira.habana-labs.com/browse/SW-212516

Found two recent merged PR breaks down Spec Decode functionality:

1. #491 overrides existing
workerwrapperBase design for speculative decoding.
```
if model_runner_cls is not None:
    ModelRunnerClass = model_runner_cls
```
is not needed since we now use codes as below for init model_runner_cls
to follow upstream design.
```
if model_runner_cls is not None:
            self.model_runner = model_runner_cls(self.model_runner)
```

2. #566 is not working in Spec
Decode Eagle mode
Due to input tensors is now different to the pre-assumption that
decode_fwd only provide one token per seq. Spec Decode provides multiple
candidates tokens as q.
To fix that, added a new ENV - "**VLLM_COS_SIN_RECOMPUTE**=true", need
to use it to trigger recompute to cos and sin for spec decode.

---------

Signed-off-by: Chendi.Xue <[email protected]>


          fix slow sampling when repetition_penalty is set. (#584)

27a22ab

This PR is to fix the slow sampling in HPU when repetition_penalty is
set in the sampling parameters.

It replaces the slow pytorch API on HPU and mitigate the dynamic shapes
in the code.

Without this PR:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0,
repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1,
min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[],
include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024,
min_tokens=0, logprobs=None, prompt_logprobs=None,
skip_special_tokens=True, spaces_between_special_tokens=True,
truncate_prompt_tokens=None, guided_decoding=None)
Warming up...
Profiling iterations: 100%|5/5 [03:32<00:00, 42.49s/it]
Avg latency: 42.49439047839987 seconds
10% percentile latency: 11.322476224999628 seconds
25% percentile latency: 11.32563829100036 seconds
50% percentile latency: 11.331052645000455 seconds
75% percentile latency: 11.333669468998778 seconds
90% percentile latency: 104.8302020711999 seconds
99% percentile latency: 160.92812163252054 seconds

With PR:
Avg latency: 11.038154767800005 seconds
10% percentile latency: 10.964674918200398 seconds
25% percentile latency: 10.964709408001 seconds
50% percentile latency: 10.966433088000485 seconds
75% percentile latency: 10.967024742998547 seconds
90% percentile latency: 11.18358270219942 seconds
99% percentile latency: 11.313517477719943 seconds

Testing code:

https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh

The only difference about this PR and
#442 is that I do not enable
pin_memory as this feature readiness is poor on HPU.


          Optimize for topk=1 case if we do not handle duplicates (#603)

9d6917f

Original : #599

We have a case where topk=1, and topp=<1.

Adding special handling for the case topk=1 and handle_duplicate=0 (by
default handle_duplicate=0, to support num-scheduling-steps)


          [bugfix] fix RuntimeError on apc (#648)

5d582b5

This PR fixes a bug that results in the following RuntimeError when APC
is enabled.
```
ERROR 12-19 02:30:05 engine.py:140]   File "/workspace/vllm/worker/hpu_model_runner.py", line 854, in _prepare_prompt
ERROR 12-19 02:30:05 engine.py:140]     if prefix_block_list_tensor:
ERROR 12-19 02:30:05 engine.py:140] RuntimeError: Boolean value of Tensor with more than one value is ambiguous
```


          Add llava support to benchmark_throuhput (#665)

585ca9a

Add llava suport with a prompt for benchmark throughput using images


          Add mllama support to benchmark_throughput (#668)

8f53dee


          Add mark_step for encoder layers (#669)

49a11e2

This is a updated version from
#650.


Coupled with [Use FusedSDPA for MllamaVisionSdpaAttention
#620], these two issues
arising when running llama3.2 vision model can be resolved:

GC fail when batchsize>1 on Gaudi3.
Increased device memory consumption with Torch 2.5 compared to Torch
2.4.

---------

Signed-off-by: yan ma <[email protected]>
Co-authored-by: yisonzhu <[email protected]>


          Use FusedSDPA for MllamaVisionSdpaAttention (#620)

cccf363

Use `FusedSDPA` instead of regular `F.scaled_dot_product_attention` in
`MllamaVisionSdpaAttention` module.

The difference between these two ops is precision, since
`F.scaled_dot_product_attention` converts the input to float32 and
performs all operations on this data type, while `FusedSDPA` does not.
However, it change accuracy only from 0.449 to 0.446 on accuracy test
based on MMMU dataset and lm-evaluation-harness, while improves single
prompt performance from ~550ms to ~100ms.


          Limit number of dummy cross attention blocks (#667)

fa9dbf2

Fix warmup for encoder-decoder model runner by limiting the number of
dummy cross attention blocks to available blocks. Without this we will
encounter an error in CrossAttention due to lack of available blocks.


          [SW-197036] - use torch._scaled_mm with hpu (#660)

73aaf71

Remove WA using torch.ops.hpu.fp8_gemm_v2 for hpu.


          Handle LoRA specific changes in MSS (#675)

c5975f8

This PR adds changes required to enable MSS with LoRA flow. Checked
there are no regressions using vllm-fork CI job
https://tf-jenkins-ctrl01.habana-labs.com/job/vLLM/view/CI-jobs/job/vLLM-CI-Pipeline/429/


          [SW-201504] Trigger Internal Tests (#538)

c83289e

- Added actionlint.yaml to allow usage of self-hosted runners (without
it actionlint will throw error) - I also tried to disable some of
shellcheck warnings/errors but couldn't do that so probably this PR
should be merged even though actionlint is failing
- Update Trigger Jenkins workflow - now it will contain 4 jobs:
1. Dependency Scan - will fail the job if a dependency with high
severity vulnerability will be part of the PR
2. CodeQL Scan - scan the python code itself 
3. Calculate Tests To Trigger - will read the .jenkins/test_config.yaml
file and based on it trigger all the tests configured on it
4. Tests - The tests running on Gaudi resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

habana

86 participants